Using Bag of Words (BOW) and Standard Deviations to Represent Expected Structures for Document Retrieval: A Way of Thinking that Leads to Method Choices

نویسندگان

  • Harvey S. Hyman
  • Warren Fridy
چکیده

Summary This paper discusses a theory and proposed design for a retrieval artifact using a bag of words (BOW) approach for terms and a standard deviation method for assigning weights. The paper is organized into three parts. The first part is an historical overview of the development of text mining techniques. It is intended to acquaint the reader with our perspectives and assumptions; it is not intended to be an exhaustive review of the literature. The second part discusses the application of text mining techniques to the legal domain, specifically eDiscovery. The third part describes our approach to the 2010 TREC Legal Track problem set #301. Part Three is sub-divided into three sections. Section one is a discussion of our approach to document retrieval. Section two is a description of our design approach and method specifically chosen for Problem #301. Section three discusses contributions, limitations, generalizability, and conclusions based on our experience with the initial results produced by the competing artifacts in the TREC proceeding. We include a discussion of the initial results as reported at the conference.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

A Novel Method for Content Base Image Retrieval Using Combination of Local and Global Features

Content-based image retrieval (CBIR) has been an active research topic in the last decade. In this paper we proposed an image retrieval method using global and local features. Firstly, for local features extraction, SURF algorithm produces a set of interest points for each image and a set of 64-dimensional descriptors for each interest points and then to use Bag of Visual Words model, a cluster...

متن کامل

A Novel Method for Content Base Image Retrieval Using Combination of Local and Global Features

Content-based image retrieval (CBIR) has been an active research topic in the last decade. In this paper we proposed an image retrieval method using global and local features. Firstly, for local features extraction, SURF algorithm produces a set of interest points for each image and a set of 64-dimensional descriptors for each interest points and then to use Bag of Visual Words model, a cluster...

متن کامل

Improved language modelling using bag of word pairs

The bag-of-words (BoW) method has been used widely in language modelling and information retrieval. A document is expressed as a group of words disregarding the grammar and the order of word information. A typical BoW method is latent semantic analysis (LSA), which maps the words and documents onto the vectors in LSA space. In this paper, the concept of BoW is extended to Bag-of-Word Pairs (BoW...

متن کامل

Improving Performances of BoW-based Image Retrieval by Using Contextual Keypoint Descriptors

The paper reports an improved method of content-based image retrieval using a well-known method of bag-of words (BoW). Words built over descriptors of popular affine-invariant keypoint detectors (Harris-Affine and Hessian-Affine are exemplary choices) are used. What is novel, however, is the number of descriptors (i.e. the number of words) representing individual keypoints. Instead of SIFT (or ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010